1 Introduction

Genome wide association studies (GWAS) can reveal important genotype to phenotype associations, however, data quality and interpretability issues must be addressed. The GWAX approach enables rational ranking, filtering and interpretation of GWAS via metrics, methods, and interactive visualization. Each inferred gene-to-trait association is evaluated for confidence and relevance, with scores solely derived from aggregated statistics, linking a protein-coding gene and phenotype. Applicability and thresholds will depend on use cases.

Issues:

To-do:

1.1 About NHGRI-EBI GWAS Catalog

GWAS Catalog (http://www.ebi.ac.uk/gwas/) studies each have a study_accession. Also are associated with a publication (PubMedID), but not uniquely. See https://www.ebi.ac.uk/gwas/docs/fileheaders.

Some key definitions:

`MAPPED GENE(S)`: Gene(s) mapped to the strongest SNP. If the SNP is located
within a gene, that gene is listed. If the SNP is intergenic, the upstream
and downstream genes are listed, separated by a hyphen.

`REPORTED GENE(S)`*: Gene(s) reported by author

`OR or BETA`: Reported odds ratio or beta-coefficient associated with
strongest SNP risk allele. Note that if an OR <1 is reported this is
inverted, along with the reported allele, so that all ORs included in
the Catalog are >1. Appropriate unit and increase/decrease are included
for beta coefficients.
  • Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, and Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Research, 2014, Vol. 42 (Database issue): D1001-D1006.
## [1] "Mon Jun 10 12:06:32 2019"

2 Read files

From GWAS Catalog, TCRD, and EFO.

3 Genome Wide Association Studies (GWAS)

3.0.1 Counts by year

## [1] "Studies total: 5774 ; accessions: 5774 ; traits: 3472 ; PMIDs: 3596"

### Laboratory platforms

Grouped by vendor (first in list if multiple), though technologies may have evolved for a given vendor.

3.0.2 Journals

Top journals by N_gwas
JOURNAL N_gwas N_pmid N_assn
Nat Genet 797 522 16177
PLoS One 383 226 4966
PLoS Genet 352 178 6419
Hum Mol Genet 352 243 4154
Nat Commun 307 128 7928
Am J Hum Genet 162 85 3238
Sci Rep 149 79 1003
Mol Psychiatry 138 98 1970
Circ Cardiovasc Genet 88 54 623
Am J Med Genet B Neuropsychiatr Genet 80 47 1070
Diabetes 72 39 463
Hum Genet 72 51 1693

4 iCite Relative Citation Ratio (RCR)

iCite annotations from iCite API, with all PMIDs from GWASCatalog. New publications may lack iCite RCR. Should we impute RCR=median as reasonable prior?

## [1] "N_pmid = 5774"                          
## [2] "mean = 4.2 ; median = 2.0 ; max = 161.8"
## [3] "90%ile = 8.5"                           
## [4] "(Plot truncated at 25.)"

5 Associations (SNP to trait)

## [1] "Associations total: 87601 ; SNPs: 59580 ; traits: 2970 ; PMIDs: 3110"

6 SNP to gene mappings

## [1] "snp2gene: total associations: 266000 ; studies: 4902 ; snps: 60099 ; genes: 22392 ; intergenic associations: 5145 ; chromosomal location associations: 38630"
REPORTED_OR_MAPPED N
reported 116747
mapped_upstream 38759
mapped_downstream 38759
mapped_within 71735

7 Gene counts

## [1] "Studies: 4902"
## [1] "MAPPED_GENE values: 22750"
## [1] "REPORTED_GENE values: 18396"
## [1] "TCRD targets: 19947 ; geneSymbols: 19736"
## [1] "GSYMBs mapped to TCRD: 13725"
## [1] "Tbio: 7995"  "Tchem: 1513" "Tclin: 512"  "Tdark: 3831"

8 Gene-SNP-Study-Trait (G2T) associations

g2t should have one row for each gene-snp-study-trait association.

## [1] "GTs with pvalue_mlog, g2t: 242998 ; genes: 20681 ; traits: 1601"
## [1] "GTs with or_or_beta, g2t: 242998 ; genes: 20681 ; traits: 1601"
## [1] "GTs with oddsratio, g2t: 60065 ; genes: 12177 ; traits: 913"
## [1] "GTs with beta, g2t: 137631 ; genes: 12407 ; traits: 767"

9 Traits and EFO

EFO = Experimental Factor Ontology. Includes Orphanet, PO, Mondo and Uberon classes. TSV from source OWL.

## [1] "EFO total classes: 29085"
EFO sources, top 5
Ontology N_in_gwas N_total
EFO 1719 9642
GO 67 348
HP 46 474
Orphanet 40 5989
CHEBI 2 1317

9.1 GWAS trait-subclass relationships

## [1] "EFO classes: 29085 ; total subclass relationships: 48420"
## [1] "GWAS trait-subclass pairs: 1280"
Top EFO trait-subclass pairs
trait_id trait_name subclass_id subclass_name trait_N_gwas subclass_N_gwas
EFO_0004340 body mass index EFO_0005937 longitudinal BMI measurement 96 12
EFO_0004340 body mass index EFO_0005935 overweight body mass index status 96 3
EFO_0004340 body mass index EFO_0005936 underweight body mass index status 96 2
EFO_0004340 body mass index EFO_0005851 height-adjusted body mass index 96 1
EFO_0004340 body mass index EFO_0007041 obese body mass index status 96 1
EFO_0000692 schizophrenia EFO_0004609 treatment refractory schizophrenia 94 6
EFO_0000305 breast carcinoma EFO_1000649 estrogen-receptor positive breast cancer 77 14
EFO_0000305 breast carcinoma EFO_1000650 estrogen-receptor negative breast cancer 77 10
EFO_0000305 breast carcinoma EFO_1002010 TP53 Positive Breast Carcinoma 77 1
EFO_0000249 Alzheimer’s disease EFO_1001870 late-onset Alzheimers disease 74 2
EFO_0004612 high density lipoprotein cholesterol measurement EFO_0007805 HDL cholesterol change measurement 70 3
EFO_0000270 asthma EFO_0004591 childhood onset asthma 60 12
EFO_0000270 asthma EFO_1002011 adult onset asthma 60 3
EFO_0004530 triglyceride measurement EFO_0007681 triglyceride change measurement 59 5
EFO_0005842 colorectal cancer EFO_1000657 rectum cancer 59 2
EFO_0005842 colorectal cancer EFO_1001480 metastatic colorectal cancer 59 2
EFO_0001663 prostate carcinoma EFO_0000196 metastatic prostate cancer 56 2
EFO_0004611 low density lipoprotein cholesterol measurement EFO_0007804 LDL cholesterol change measurement 55 4
EFO_0000685 rheumatoid arthritis EFO_0003898 ankylosing spondylitis 46 7
EFO_0000685 rheumatoid arthritis EFO_0003778 psoriatic arthritis 46 4
EFO_0003923 bone density EFO_0007701 spine bone mineral density 45 15
EFO_0001645 coronary heart disease EFO_0000378 coronary artery disease 45 14
EFO_0003923 bone density EFO_0007702 hip bone mineral density 45 9
EFO_0003923 bone density EFO_0007933 radius bone mineral density 45 3

9.3 EFO to DOID (Disease Ontology ID)

From EBI Ontology Xref Service (OxO). One-to-many and many-to-one mappings exist. Keep only closest mappings, maximum distance=2.

## [1] "GWAS EFO_IDs (Total): 326"
## [1] "GWAS EFO_ID to DO_ID mappings (distance<=2): EFO_IDs: 104, DO_IDs: 230"
## [1] "GWAS EFO_ID to DO_ID mappings (efo_name=doid_name): EFO_IDs: 65, DO_IDs: 65"
## [1] "GWAS EFO_ID to DO_ID mappings (distance=1): EFO_IDs: 99, DO_IDs: 109"
## [1] "GWAS EFO_ID to DO_ID mappings (distance=2): EFO_IDs: 53, DO_IDs: 126"
EBI OXO EFO-DOID (sample)
efo_id do_id efo_name doid_name distance
EFO_0003914 DOID_3393 atherosclerosis coronary artery disease 2
EFO_0001645 DOID_3393 coronary heart disease coronary artery disease 1
EFO_0000692 DOID_0080281 schizophrenia schizophrenia 19 2
EFO_0003890 DOID_302 drug dependence substance abuse 2
EFO_0004253 DOID_585 nephrolithiasis nephrolithiasis 1
EFO_0000574 DOID_8675 lymphoma lymphosarcoma 2
EFO_0000178 DOID_5517 gastric carcinoma stomach carcinoma 1
EFO_0000378 DOID_3145 coronary artery disease hyperlipoproteinemia type III 2
EFO_0000270 DOID_9415 asthma allergic asthma 2
EFO_0000305 DOID_3459 breast carcinoma breast carcinoma 1

10 GENE-TRAIT stats

Read gt_stats.tsv, built by gwax_gt_stats.R for GWAX. Statistics designed to weigh evidence aggregated across studies, for each gene-trait association.

## [1] "nrow(gt) = 33977"

10.0.1 Focus on traits with more data and evidence, likely high scientific interest.

Most-highly studied traits
trait trait_ids N_genes
schizophrenia EFO_0000692 1716
intelligence EFO_0004337 1232
autism spectrum disorder EFO_0003756 886
prostate carcinoma EFO_0001663 549
inflammatory bowel disease EFO_0003767 527
systemic lupus erythematosus EFO_0002690 507
age at onset EFO_0004847 481
Crohn’s disease EFO_0000384 468
lung carcinoma EFO_0001071 451
Alzheimer’s disease EFO_0000249 433
bipolar disorder EFO_0000289 422
asthma EFO_0000270 418

11 Prototype GWAX web app.

11.1 Plot single trait with all associated genes.

  • X-axis: Evidence (n_study)
  • Y-axis: Effect (median(OR))
  • Other considerations:
  • n_traits_this_gene - Normally prefer low value, but depends on other traits, semantics/ontology.
  • n_snp - How many SNPs? But is more or fewer better?
  • pval_median - Interpretation may be a challenge.
  • Note/issue: Ignoring beta values because not comparable, due to varying units. (But maybe can be aggregated if units agree.)

Color unmapped gray.

Plot for a selected trait:

## [1] "http://www.ebi.ac.uk/efo/EFO_0002508: Parkinson's disease"

11.1.1 Pareto filter

Selects N non-dominated solutions on 2D multi-objective boundary.

11.1.2 Filter for (1) Evidence (n_study) and (2) effect (or_median).

11.1.3 Plot

Top hits:

GWAX: Parkinson’s disease, top hits
gsymb name fam tdl n_study rcras or_median pvalue_mlog_median
C16orf75 NA NA NA 7 24.304 1.270 43.000
MHC NA NA NA 6 17.928 1.240 16.398
GAK Cyclin-G-associated kinase Kinase Tchem 5 19.635 1.260 50.000
KIAA1267 NA NA NA 3 9.586 1.272 28.000
CCHCR1 Coiled-coil alpha-helical rod protein 1 NA Tbio 3 9.508 1.390 6.000
PRDM15 PR domain zinc finger protein 15 TF; Epigenetic Tdark 3 12.415 1.143 23.097
ITGA8 Integrin alpha-8 NA Tbio 3 10.556 1.330 5.523
GLTSCR1L GLTSCR1-like protein NA Tdark 3 10.059 1.190 7.523
HLA-DRA HLA class II histocompatibility antigen, DR alpha chain NA Tbio 2 5.25 1.310 9.301
PLEK Pleckstrin NA Tbio 2 4.247 1.350 5.699